DATA1220-55, Fall 2024
2024-09-09
Could you put “DATA1220:” at the beginning of the subject line of your emails?
I have 3 emails, and this will help me spot and respond to yours more quickly.
Late policy: “This homework is due by 6:00pm on Monday, 9/9/24. No credit will be lost for assignments received by 7:00pm to account for issues with uploading. 10% of the points will be deducted from assignments received by 9:00am on Tuesday, 9/10/24. Assignments turned in after this point are only eligible for 50% credit, so it benefits you to turn in whatever you have completed by the due date.”
Data science pipeline priorities for Chapter 1
Describe the “shape” (i.e. distribution) of numerical variables
Calculate means, medians, modes, variances, standard deviations, IQRs
Learn the appropriate use of summary statistics (i.e. mean vs. median)
Characterize the relationship between 2 numerical variables
Analyze contingency (e.g. 2x2) tables
Summarizing categorical variables with proportions
Comparison of numerical data between categorical groups
Recognize common visualization techniques / plots
Numerical: Dot plots, histograms, density plots, QQ plots, box plots, violin plots
Categorical: bar plots, mosaic plots, tree map
Build basic visualizations in R using ggplot2
Data visualization do’s and dont’s
Numerical variables can be continuous or discrete.
The “shape” of numerical data is called its distribution.
Location: the “center” of the data
Scale: the “spread” of the data
Commonly observed patterns in numerical distributions
Unimodal distributions have one peak around which observations cluster
Bimodal distributions have 2 peaks around which observations cluster.
Trimodal distributions have 3 peaks around which observations cluster.
Symmetric distributions are unimodal observations that don’t have “tails” of extreme negative or positive values
Left-skewed distributions have an excess of observations at the low end of the data range.
Right-skewed distributions have an excess of observations at the high end of the data range.
The location of a numerical variable’s distribution can be thought of as the “center” of the data, around which the bulk of the observations cluster.
Mean: the sum of a values divided by the number of observations (i.e. “average”)
Median: the value in the exact middle of the data
Mode: the most common value in the data (for discrete variables)
Where are the bulk of observations concentrated?
The sample mean \(\bar{x}\) is computed as the sum of all observed values \(\sum_{i=1}^n{x_i}\), where \(i\) is the observation number, divided by the total number of observations \(n\).
\[ \bar{x}=\frac{\sum_{i=1}^n x_i}{n} \]
or
\[ \bar{x}=\frac{\operatorname{sum}(x_1, x_2, ..., x_n)}{n}=\frac{x_1+x_2+...+x_n}{n} \]
Consider the numerical variable x.
You can calculate the mean manually…
Or you can use the mean() function.
The sample mean is denoted as \(\bar{x}\). The population mean is denoted \(\mu\). They are calculated the same way.
The sample mean is considered to be a good point estimate of the population mean if the sample population is representative of the study/target population.
What makes for a good sample?
The median is the middle value when the data are sorted in order.
When the number of observations \(n\) is odd, this works as stated.
When the number of observations \(n\) is even, the median is calculated as the mean of the 2 middle values.
How far is each data value from the mean?
Variance: \(s^2\), the sum of the squared differences between each observation’s value and the sample mean \(\bar{x}\) divided by \(n-1\)
Standard deviation: \(s\), the square root of the variance
Range: minimum to maximum
Interquartile Range (IQR): 25th percentile to 75th percentile
The deviance is how far each data value is from the mean. The variance, denoted as \(s^2\), is the squared sum of all observation deviations \(\sum_{i=1}^n (y_i-\bar{y})^2\) where \(i\) is the observation number, divided by \(n-1\).
\[ \operatorname{Variance}=s^2=\frac{\sum_{i=1}^n (y_i-\bar{y})^2}{n-1} \]
The standard deviation is the square root of the variance, and is interpreted in the original unit of measurement for that variable.
\[ \operatorname{Standard Deviation}=s=\sqrt{\frac{\sum_{i=1}^n (y_i-\bar{y})^2}{n-1}} \]
The range of the data is the difference between the maximum value and the minimum value.
\[ \operatorname{Range}=\operatorname{max}(x)-\operatorname{min}(x) \]
The 25th percentile of the data is called the first quartile or Q1
The 50th percentile of the data is called the median
The 75th percentile of the data is called the third quartile or Q3
The range between Q3 and Q1 is called the interquartile range or IQR.
\[ \operatorname{IQR}=\operatorname{Q3}-\operatorname{Q1} \]
The presence of outliers and/or skew in a numerical variable’s distribution affects how well summary statistics describe a distribution’s location.
The median and interquartile range are considered to be robust statistics for the numerical summary of data because they are less sensitive to skew and outliers than the mean, variance, and standard deviation.
DATA1220-55 Fall 2024, Class 05 | Updated: 2024-09-09 | Canvas | Campuswire